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Around Kolmogorov complexity: 
basic notions and results 

Alexander Shen* 


Abstract 

Algorithmic information theory studies description complexity and ran¬ 
domness and is now a well known field of theoretical computer science and 
mathematical logic. There are several textbooks and monographs devoted 
to this theory EJ [1, 2 IS 13 where one can find the detailed exposition of 
many difficult results as well as historical references. However, it seems that 
a short survey of its basic notions and main results relating these notions to 
each other, is missing. This report attempts to fill this gap and covers the 
basic notions of algorithmic information theory: Kolmogorov complexity 
(plain, conditional, prefix), Solomonoff universal a priori probability, no¬ 
tions of randomness (Martin-Lof randomness, Mises-Church randomness), 
effective Hausdorff dimension. We prove their basic properties (symmetry 
of information, connection between a priori probability and prefix complex¬ 
ity, criterion of randomness in terms of complexity, complexity characteriza¬ 
tion for effective dimension) and show some applications (incompressibility 
method in computational complexity theory, incompleteness theorems). It 
is based on the lecture notes of a course at Uppsala University given by the 
author |[6j]. 


1 Compressing information 

Everybody is familiar with compressing/decompressing programs such as zip, 
gzip, compress, arj, etc. A compressing program can be applied to an arbitrary 
file and produces a “compressed version” of that file. If we are lucky, the com¬ 
pressed version is much shorter than the original one. However, no information is 
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lost: the decompression program can be applied to the compressed version to get 
the original file HI 

How is it possible? A compression program tries to find some regularities in 
a file which allow it to give a description of the file than is shorter than the file 
itself; the decompression program then reconstructs the file using this description. 


2 Kolmogorov complexity 

The Kolmogorov complexity may be roughly described as “the compressed size”. 
However, there are some differences. Instead of files (byte sequences) we consider 
bit strings (sequences of zeros and ones). The principal difference is that in the 
framework of Kolmogorov complexity we have no compression algorithm and 
deal only with decompression algorithms. 

Here is the definition. Let U be an algorithm whose inputs and outputs are 
binary strings. Using U as a decompression algorithm, we define the complexity 
Cu(x) of a binary string x with respect to U as follows: 

Cu{x) = min{|y|: U{y) =*} 

(here |y| denotes the length of a binary string y). In other words, the complexity 
of x is defined as the length of the shortest description of x if each binary string y 
is considered as a description of U (y) 

Let us stress that U (y) may be defined not for all y, and there are no restrictions 
on the time necessary to compute U (y). Let us mention also that for some U and 
x the set of descriptions in the definition of Cjj may be empty; we assume that 
min(0) = +oo in this case. 


3 Optimal decompression algorithm 

The definition of C\j depends on U. For the trivial decompression algorithm 
U(y) —y we have Cu(x) — \x\. One can try to find better decompression algo¬ 
rithms, where “better” means “giving smaller complexities”. However, the num¬ 
ber of short descriptions is limited: There is less than 2" strings of length less 
than n. Therefore, for every fixed decompression algorithm the number of strings 

'imagine that a software company advertises a compressing program and claims that this pro¬ 
gram can compress every sufficiently long file to at most 90% of its original size. Why wouldn’t 
you buy this program? 
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whose complexity is less than n does not exceed 2" — 1. One may conclude that 
there is no “optimal” decompression algorithm because we can assign short de¬ 
scriptions to some string only taking them away from other strings. However, 
Kolmogorov made a simple but crucial observation: there is asymptotically opti¬ 
mal decompression algorithm. 

Definition An algorithm U is asymptotically not worse than an algorithm V if 
Cjj{x) ^ Cy{x) +C for come constant C and for all x. 

Theorem 1. There exists an decompression algorithm U which is asymptotically 
not worse than any other algorithm V. 

Such an algorithm is called asymptotically optimal. The complexity C\j with 
respect to an asymptotically optimal U is called Kolmogorov complexity. The 
Kolmogorov complexity of a string x is denoted by C(x). (We assume that some 
asymptotically optimal decompression algorithm is fixed.) Of course, Kolmogorov 
complexity is defined only up to 0(1) additive term. 

The complexity C (x) can be interpreted as the amount of information in x or 
the “compressed size” of x. 

4 The construction of optimal 
decompression algorithm 

The idea of the construction is used in the so-called “self-extracting archives”. 
Assume that we want to send a compressed version of some file to our friend, but 
we are not sure he has the decompression program. What to do? Of course, we 
can send the program together with the compressed file. Or we can append the 
compressed file to the end of the program and get an executable file which will 
be applied to its own contents during the execution (assuming that the operating 
system allows to append arbitrary data to the end of an executable file). 

The same simple trick is used to construct an universal decompression algo¬ 
rithm U. Having an input string x, the algorithm U starts scanning x from left 
to right until it founds some program p written in a fixed programming language 
(say, Pascal) where programs are self-delimiting, so the end of the program can 
be determined uniquely. Then the rest of x is used as an input for p, and U (v) is 
defined as the output of p. 

Why U is (asymptotically) optimal? Consider another decompression algo- 
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rithm V. Let v be a (Pascal) program which implements V. Then 

Cu{x) ^ Cv(x) + |v| 

for arbitrary string x. Indeed, if y is a V-compressed version of jc (i.e., V (>’) = x), 
then vy is U -compressed version of x (i.e., U(vy) = x) and is only |v| bits longer. 


5 Basic properties of Kolmogorov complexity 

Theorem 2. 

(a) C(x) f |jc|+0(1). 

(b) The number of x such that C(x) f n is equal to 2" up to a bounded factor 
separated from zero. 

(c) For every computable function f there exists a constant c such that 

C(f(x)) < C(x) T c 
(for every x such that fix) is defined). 

(d) Assume that for each natural n a finite set V n containing no more than 2 n 
elements is given. Assume that the relation x G V„ is enumerable, i.e., there is an 
algorithm which produces the (possibly infinite) list of cdl pairs (x,n) such that 
x G V„. Then there is a constant c such that cdl elements ofV n have complexity at 
most n + c (for every n). 

(e) The “typical” binary string of length n has complexity close to n: there 
exists a constant c such that for every n more than 99% of all strings of length n 
have complexity in-between n — c and n + c. 

Proof, (a) The asymptotically optimal decompression algorithm U is not worse 
that the trivial decompression algorithm V (y) = y. 

(b) The number of such x does not exceed the number of their compressed 
versions, which is limited by the number of all binary strings of length not ex¬ 
ceeding n, which is bounded by 2" +1 . On the other hand, the number of .r’s such 
that K(x) f n is not less than 2 n ~ c (here c is the constant from (a)), because all 
strings of length n — c have complexity not exceeding n. 

(c) Let U be the optimal decompression algorithm used in the definition of C. 
Compare U with decompression algorithm V : y hg f(U (y)): 

Cu{f{x)) ^ C v (f(x)) + 0{ 1) ^ Cu(x)+0( 1) 
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(each U -compressed version of jc is a V-compressed version of fix)). 

(d) We allocate strings of length n to be compressed versions of strings in V n 
(when a new element of V„ appears during the enumeration, the first unused string 
of length n is allocated). This procedure provides a decompression algorithm W 
such that Cw (*) ^ n for every x E V n . 

(e) According to (a), all strings of length n have complexity not exceeding 
77 + c for some c. It remains to mention that the number of strings whose com¬ 
plexity is less than n — c does not exceed the number of all their descriptions, i.e., 
strings of length less than n — c. Therefore, for c = 7 the fraction of strings having 
complexity less than n—c among all the strings of length n does not exceed 1%. 

□ 


Problems 

1. A decompression algorithm D is chosen in such a way that Cp{x) is even 
for every string x. Could D be optimal? 

2. The same question if Cp(x) is a power of 2 for every x. 

3. Let D be the optimal decompression algorithm. Does it guarantee that 
D(D(x)) is also an optimal decompression algorithm? 

4. Let be a computable sequence of decompression algorithms. 

Prove that C(x) f Cp i (x) + 2log i + 0(1) for all i and x (the constant in 0(1) does 
not depend on x and i). 

5. * Is it true that C(xy) ^ C(x) + C(y) + 0( 1) for all .v and y? 


6 Algorithmic properties of C 

Theorem 3. The complexity function C is not computable; moreover, every com¬ 
putable lower bound for C is bounded from above. 

Proof. Assume that some partial function g is a computable lower bound for C, 
and g is not bounded from above. Then for every m we can effectively find a string 
x such that C{x) > m (indeed, we should compute in parallel g(x) for all strings x 
until we find a string x such that g(x) > m). Now consider the function 

f(m) — the first string x such that g(x) > m 

Here “first” means “first discovered” and m is a natural number written in binary 
notation; by our assumption, such x always exists, so / is a total computable 


5 



function. By construction, C(/(m)) > m; on the other hand, C(/(m)) + C(m) + 
(9(1). But K(m) + \m\ + (9(1), so we conclude that m + |m| + (9(1) which is 
impossible (the left-hand side is a natural number, the right-hand side—the length 
of its binary representation). □ 

This proof is a formal version of the well-known Berry paradox about “the 
smallest natural number which cannot be defined by twelve English words” (the 
quoted sentence defines this number and contains exactly twelve words). 

The non-computability of C implies that any optimal decompression algorithm 
U is not everywhere defined (otherwise Cy would be computable). It sounds like 
a paradox: If U ix) is undefined for some x we can extend U on x and let U(x) —y 
for some y of large complexity; after that Cy(y ) becomes smaller (and all other 
values of C do not change). However, it can be done for one x or for finite number 
of jc’s but we cannot make U defined everywhere and keep U optimal at the same 
time. 


7 Complexity and incompleteness 

The argument used in the proof of the last theorem may be used to obtain an 
interesting version of Godel first incompleteness theorem. This application of 
complexity theory was invented and advertised by G. Chaitin. 

Consider a formal theory (like formal arithmetic or formal set theory). It may 
be represented as a (non-terminating) algorithm which generates statements of 
some fixed formal language; generated statements are called theorems. Assume 
that the language is rich enough to contain statements saying that “complexity 
of 010100010 is bigger than 765” (for every bit string and every natural num¬ 
ber). The language of the formal arithmetic satisfies this condition as well as the 
language of the formal set theory. Let us assume also that all theorems of the 
considered theory are true. 

Theorem 4. There exists a constant c such that all the theorems of type “C(x) > 
n ” have n < c. 

Proof. Indeed, assume that it is not true. Consider the following algorithm a: 
For a given integer k, generate all the theorems and look for a theorem of type 
C(x) > s for some x and some s greater than k. When such a theorem is found, x 
becomes the output a(s) of the algorithm. By our assumption, a (s) is defined for 
all 5 . 
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All theorems are supposed to be true, therefore a ( 5 ) is a bit string whose 
complexity is bigger than 5 . As we have seen, this is impossible, since K(a(s)) ^ 
K(s) + 0(1) + l^l + 0(1) where |s| is the length of the binary representation of 
5. □ 

(We may also use the statement of the preceding theorem instead of repeating 
the proof.) 

This result implies the classical Godel theorem (it says that there are true un- 
provable statements), since there exist strings of arbitrarily high complexity. 

A constant c (in the theorem) can be found explicitly if we fix a formal theory 
and the optimal decompression algorithm and for most natural choices does not 
exceed — to give a rough estimate — 100,000. It leads to a paradoxical situation: 
Toss a coin 10 6 times and write down the bit string of length 1,000,000. Then 
with overwhelming probability its complexity will be bigger than 100,000 but 
this claim will be unprovable in formal arithmetic or set theory. 


8 Algorithmic properties of C (continued) 

Theorem 5. The function C(x) is upper semicomputable, i.e., C(x) can be rep¬ 
resented as lim k(x.n) where k(x. n) is a total computable function with integer 

n —>00 

values and 

k{x 1 0) ^ k{x 1 1) ^ k(x , 2) ^ ... 

Note that all values are integers, so for every x there exists some N such that 
k(x,n) — C(x) for all n > N. 

Sometimes upper semicomputable functions are called enumerable from above. 

Proof Let k(x , n) be the complexity of x if we restrict by n the computation time 
used for decompression. In other words, let U be the optimal decompression 
algorithm used in the definition of C. Then k(x,n) is the minimal |y| for all y such 
that U(y) — x and the computation time for U (y) does not exceed n. □ 

(Technical correction: it can happen (for small n) that our definition gives 
k(x,n) — 00 . In this case we let k(x,n) — |jc| +c where c is chosen in such a way 
that C(x) + |jc| +c for all x.) 
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9 An encodings-free definition of complexity 

The following theorem provides an “encodings-free” definition of Kolmogorov 
complexity as a minimal function K such that K is upper semicomputable and 

\{x\K(x)<n}\ = 0(2 n ). 

Theorem 6. Let K(x) be an upper semicomputable function such that {x | K(x) < 
n } C M ■ 2 n for some constant M and for all n. Then there exists a constant c such 
that C(x) K(x) + cfor all x. 

Proof This theorem is a reformulation of one of the statements above. Let V n be 
the set of all strings such that K(x) < n. The binary relation x € V„ (between x 
and n) is enumerable. Indeed, K{x ) = lim k(x,m) where k is a total computable 
function that is decreasing as a function of m. Compute k(x. m) for all x and 
m in parallel. If it happens that k(x.in) < n for some x and m, add x into the 
enumeration of V„. (The monotonicity of k guarantees that in this case K{x) < n.) 
Since lim k(x,m) = K(x), every element of V„ will ultimately appear. 

By our assumption \V n \ f M -2". Therefore we can allocate strings of length 
n + c (where c = [log 2 M]) as descriptions of elements of V n and will not run 
out of descriptions. In this way we get a decompression algorithm D such that 
Co(x) ^ n + c for x EV„. Since K{x) < n implies C/)(x) f n + c for all .v and n, 
we have Co (jc) ^ K(x) +1 + c and C(x) (x) + c for some other c and all x □ 

10 Axioms of complexity 

It would be nice to have a list of “axioms” for Kolmogorov complexity that deter¬ 
mine it uniquely (up to a bounded additive term). The following list shows one of 
the possibilities. 

• A1 (Conservation of information) For every computable (partial) function 
/ there exists a constant c such that K(f(x)) ^ K(x) + c for all x such that 
f(x) is defined. 

• A2 (Enumerability from above) Function K is enumerable from above. 

• A3 (Calibration) There are constants c and C such that the cardinality of set 

I K(x) < n} is between c • 2 n and C • 2 n . 

Theorem 7. Every function K that satisfies A1-A3 differs from C only by 0(1) 
additive term. 
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Proof. Axioms A2 and A3 guarantee that C(x) f K(x) + 0(1). We need to prove 
that K(x) ^ C(x) + 0( 1). 

First, we prove that K(x) ^ |x| + 0(1). 

Since K is enumerable from above, we can generate strings x such that K(x) < 
n. Axiom A3 guarantees that we have at least 2" d strings with this property for 
some d (which we assume to be an integer). Let us stop generating them when we 
have already 2 n ~ d strings x such that K(x) <n\ let S n be the set of strings generated 
in this way. The list of all elements in S n can be obtained by an algorithm that has 
n as input; |5„| = 2 n ~ d and K(x) <n for each x G S n . 

We may assume that Si C S 2 C S 3 C ... (if not, replace some elements of S, 
by elements of S,_i etc.). Let 7)- be equal to S; + | \Sj. Then 7) has 2"~ d elements 
and all 7) are disjoint. 

Now consider a computable function / that maps elements of T n onto strings 
of length n—d. Axiom A1 guarantees then that K(x) f n + (7(1) for every string 
of length n — d. Therefore, K(x) f |jc| + 0( 1) for all x. 

Let D be the optimal decompression algorithm from the definition of C. We 
apply A1 to the function D. If p is a shortest description for x, then D(x) = p, 
therefore K(x) — K(D(p)) ^ K(p) +0(1) < \p\ + 0(1) = C(x) + 0(1). 

□ 


Problems 

1. If /: N —> N is a computable bijection, then C(f(x)) — C(x) + 0(1). Is it 

true if / is a (computable) injection (i.e., f(x) f(y) for * y)? Is it true if / is 

a surjection (for every y there is some x such that f(x) — y)? 

2. Prove that C(x) is “continuous” in the following sense: C(jcO) = C(x) + 
0(1) and C(x\) — C(x) + 0(1). 

3. Is it true that C(x) changes at most by a constant if we change the first bit 
in xl last bit in x? some bit in x? 

4. Prove that C(x01bin(C(x))) (a string x with doubled bits is concatenated 
with 01 and the binary representation of its complexity C(x)) equals C(x) + 0( 1). 


11 Complexity of pairs 

Let 

[^,y] 
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be a computable function that maps pairs of strings into strings and is an injection 
(i.e., [x,y] ^ [/,/] if x ^ x! or y ^ /). We define complexity C(x,y) of pair of 
strings as C([x,y]). 

Note that C(x.y) changes only by (9(1)-term if we consider another com¬ 
putable “pairing function”: If [x, y] | and [x,y ]2 are two pairing functions, then 
[x,y]i can be obtained from [x,y] 2 by an algorithm, so C ([x, y] 1 ) ^ C([x,y] 2 ) + 
0 ( 1 ). 

Note that 

C(x,y) ^ C(x) and C(x,y) ^ C(y) 

(indeed, there are computable functions that produce x and y from [x,y]). 

For similar reasons, C(x,y) = C(y,x) and C(x,x) = C(x). 

Wecan define C(x,y,z), C(x,y,z,t) etc. in a similar way: C(x,y,z) = C([x, [y,z]]) 
(or C(x,y,z) = C([[x,y],zj), the difference is (9(1)). 

Theorem 8. 

C(x,y) ^ C(x) + 2log C(x) + C(y) + 0(1). 

Proof. By x we denote binary string x with all bits doubled. Let D be the optimal 
decompression algorithm. Consider the following decompression algorithm Dy. 

bin(\p\)Olpq ^ [D(p),D(q)}. 

Note that LL is well defined, because the input string bin(|/;>|)01p<7 can be disas¬ 
sembled into parts uniquely: we know where 01 is, so we can find \p\ and then 
separate p and q. 

If p is the shortest description for x and q is the shortest description for y, then 
D(p) — x, D(q) =y and Z>2(bin(p)01p^) = [x,y]. Therefore 


c D 2 ([x,y}) ^ \p \ +2log \p\ + \q\ + 0(1); 
here \p\ = C(x) and \q\ — C(y) by our assumption. 


□ 


Of course, p and q can be exchanged: we can replace logC(p) by logC(< 3 r ). 


12 Conditional complexity 

We now want to define conditional complexity of x when y is known. Imagine 
that you want to send string x to your friend using as few bits as possible. If she 
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already knows some string y which is similar to x, this can be used to make the 
message shorter. 

Here is the definition. Let (p,y) i-» D(p,y) be a computable function of two 
arguments. We define the conditional complexity Co(x\y) of x when y is known 
as 

C D (x\y) = min{ \p\ \ D{p,y ) = *}. 

As usual, min(0) = +°<>. The function D is called “conditional decompressor” 
or “conditional description mode”: p is the description (compressed version) of x 
when y is known. (To get x from p the decompressing algorithm D needs y.) 

Theorem 9. There exists an optimal conditional decompressing function D such 
that for every other conditioned decompressing function if there exists a constant 
c such that 

Co{x\y) < C D fx\y) +c 

for cdl strings x and y. 

Proof. As for the non-conditional version, consider some programming language 
where programs allow two input strings and are self-delimiting. Then let 

D(uv,y ) = the output of program u applied to v,y. 

Algorithm D finds a (self-delimiting) program u as a prefix of its first argument 
and then applies u to the rest of the first argument and the second argument. 

Let D' be some other conditional decompressing function. Being computable, 
it has some program u. Then 

Co{x\y) ^ C D f(x\y) + \u\. 

Indeed, let/? be the shortest string such that if(p.y) = x (therefore, \p\ = C D fx\y)). 
Then D(up,y) = x , therefore Co(x\y) f \up\ — \p\ + |w| = C D i(x\y) + \u\. □ 

We fix some optimal conditional decompressing function D and omit the index 
D in Cz)(jc|y). Beware that C(x|y) is defined only “up to <9(l)-term”. 

Theorem 10. 

(a) C(x|y) ^ C(x) + 0(1). 

(b) For every y there exists some constant c such that 

| C(x ) — C(x|y)| ^ c. 
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This theorem says that conditional complexity is smaller than the uncondi¬ 
tional one but for every fixed condition the difference is bounded by a constant 
(depending on the condition). 

Proof, (a) If Do is an (unconditional) decompressing algorithm, we can consider 
a conditional decompressing algorithm 

D(p,y) = D 0 (p) 

that ignores conditions. Then Co(x\y) = Cd 0 (x). 

(b) On the other hand, if D is a conditional decompressing algorithm, for every 
fixed y we may consider an (unconditional) decompressing algorithm D y defined 
as 

D y (p) =D(p,y). 

Then Co y (x) — Cr>(x\y) for given y and for all x. And C(x) f Co y (x) + 0(1) 
(where 0( 1)-constant depends on y). □ 

13 Pair complexity and conditional complexity 

Theorem 11. 

C(x,y ) = C(x\y) + C(y) +0(\ogC(x) + log C(y)). 

Proof. Let us prove first that 

C(x,y ) ^ C(x\y)+ C(y) + 0 (log C(x) + log C(y)). 

We do it as before: If D is an optimal decompressing function (for unconditional 
complexity) and D 2 is an optimal conditional decompressing function, let 

D , (bm(p)0\pq) = [D 2 (p,D(q)),D(q)]. 

In other terms, to get the description of pair x,y we concatenate the shortest de¬ 
scription of y (denoted by q) with the shortest description of x when y is known 
(denoted by p). (Special precautions are used to guarantee the unique decompo¬ 
sition.) Indeed, in this case D(q) = y and D 2 (p,D(q )) = D 2 (p,y) — x, therefore 

C D '([x,y]) < \p\+2\og\p\ + \q\+0(l) ^ 

^ C(x|y) + C(y) + 0(log C (jc) + log C(y) ). 
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The reverse inequality is much more interesting. Let us explain the idea of the 
proof. This inequality is a translation of a simple combinatorial statement. Let A 
be a finite set of pairs of strings. By |A| we denote the cardinality of A. For each 
string y we consider the set A y defined as 

A y = {x\(x,y) eA}. 

The cardinality |A V | depends on y (and is equal to 0 for all y outside some finite 
set). Evidently, 

EW = |a|. 

y 

Therefore, the number of y such that |A V is big, is limited: 

|{y||A_ v |^c}|^|A|/c 


for each c. 

Now we return to complexities. Let x and y be two strings. The inequal¬ 
ity C(x\y) + C(y) < C(x,y ) + 0(logC(x) +logC(y)) can be informally read as 
follows: if C(x,y ) < m + n, then either C(.r|y) < m or C(y) < n up to logarith¬ 
mic terms. Why is it the case? Consider a set A of all pairs (x,y) such that 
C(x,y ) < m + n. There are at most 2 m+n pairs in A. The given pair (x,y) belongs 
to A. Consider the set A y . It is either “small” (contains at most 2 m elements) or 
“big” (=not small). If A y is small (|A V | ^ 2 m ), then x can be described (when y is 
known) by its ordinal number in A v , which requires m bits, and C(.r|y) does not 
exceed m (plus some administrative overhead). If A y is big, then y belongs to a 
(rather small) set Y of all strings y such that A y is big. The number of strings y 
such that |A V | > 2 m does not exceed |A|/2'" = 2". Therefore, y can be (uncondi¬ 
tionally) described by its ordinal number in Y which requires n bits (plus overhead 
of logarithmic size). 

Let us repeat this more formally. Let C(x,y) — a. Consider the set A of all 
pairs (x,y) that have complexity at most a. Let b = [log 2 |A v |J. To describe x 
when y is known we need to specify a. b and the ordinal number of x in A y (this 
set can be enumerated effectively if a and b are known since C is enumerable 
from above). This ordinal number has b + 0( 1) bits and, therefore, C(.r|y) ^ 
b + 0(\oga + \ogb). 

On the other hand, the set of all y' such that |Ay| ^ 2 b consists of at most 
|A| /2 /j> = 0(2 a ~ b ) elements and can be enumerated when a and b are known. Our 
y belongs to this set, therefore, y can be described by a, b and y’s ordinal number, 
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and C(y) ^ a — b + 0(\oga + \ogb). Therefore, C(y) + C(x|y) ^ a + 0(\oga + 
log b). 

□ 


Problems 

1. Define C(x,y,z) as C([[jt,y], [jc, z]]). Is this definition equivalent to a stan¬ 
dard one (up to ()(1 )-term)? 

2. Prove that C(jc,y) ^ C(x) +logAi(jc) + 21oglogC(x) + C(y) + 0(1). (Hint: 
repeat the trick with encoded length.) 

3. Let / be a computable function of two arguments. Prove that C(f(x 1 y) |>’) ^ 
C(x\y) + 0(1) where 0( Inconstant depends on / but not on x and y. 

4*. Prove that C(x\C(x)) — C{x ) +0(1). 


14 Applications of conditional complexity 

Theorem 12. Ifx,y, Z are strings of length at most n, then 

2 C(x,y,z) < C(x,y) + C(x,z) + C(y,z) + 0(logn) 

Proof. The statement does not mention conditional complexity; however, the proof 
uses it. Recall that (up to 0(log«)-terms) we have 

C (x, y,z)-C (x, y) — C (z\x, y) 


and 

C(x,y,z) - C(x,z ) = C{y\x,z) 

Therefore, our inequality can be rewritten as 

C(z\x,y) + C(y\x,z) ^ C{y,z), 

and the right-hand side is (up to 0(log«)) equal to C(z|y) + C(y). It remains to 
note that C(z\x,y) + C(z\y) (the more we know, the smaller is the complexity) and 

C(y\x,z) ^ C{y). 

□ 
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15 Incompressible strings 

A string x of length n is called incompressible if C{x\n) f n. A more liberal 
definition: x is c-incompressible, if C(x\n) ^ n — c. 

Note that this definition depends on the choice of the optimal decompressor 
(but the difference can be covered by an (9(1)-change in c). 

Theorem 13. For each n there exist incompressible strings of length n. For each 
n and each c the fraction of c-incompressible strings among all strings of length 
n is greater than 1 — 2 ~ c . 

Proof. The number of descriptions of length less than n — c is 1+2 + 4 + ...+ 
2«-c-i < 2 n ~ c . Therefore, the fraction of c-compressible strings is less than 

2 n ~ c /2 n = 2~ c . 

□ 

16 Computability and complexity of initial segments 

Theorem 14. An infinite sequence x = X 1 X 2 X 3 ... of zeros and ones is computable 
if and only ifC(x 1 .. .x n \n) — (9(1). 

Proof. If x is computable, then the initial segment x\...x n is a computable 
function of n, and C(f(n) \n ) = (9(1) for every computable function /. 

The other direction is more complicated. We provide this proof since it uses 
some methods that are typical for the general theory of computation (recursion 
theory). 

Assume that C(x\.. .x n \n) < c for some c and all n. We have to prove that the 
sequence x\xi ... is computable. Let us say that a string of length n is “simple” 
if C{x\n ) < c. There are at most 2 C simple strings of each length. The set of all 
simple strings is enumerable (we can generate them trying all short descriptions 
in parallel for all n). 

We call a string “good” if all its prefixes (including the string itself) are simple. 
The set of all good strings is also enumerable. (Enumerating simple strings, we 
can select strings whose prefixes are found to be simple.) 

Good strings form a subtree in full binary tree. (Full binary tree is a set of all 
binary strings. A subset T of full binary tree is a subtree if all prefixes of every 
string 1 eT are elements of T.) 
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The sequence x\X 2 ... is an infinite branch of the subtree of good strings. Note 
that this subtree has at most 2 C infinite branches because each level has at most 2 C 
vertices. 

Imagine for a while that subtree of good strings is decidable. (In fact, it is 
not the case, and we will need additional construction.) Then we can apply the 
following statement: 

Lemma 1. If a decidable subtree has only finite number of infinite branches, 
all these branches are computable. 

Proof. If two branches in a tree are different then they diverge at some point 
and never meet again. Consider a level N where all infinite branches diverge. 
It is enough to show that for each branch there is an algorithm that chooses the 
direction of branch (left or right, i.e., 0 or 1) above level N. Since we are above 
level N, the direction is determined uniquely: if we choose a wrong direction, 
no infinite branches are possible. By compactness (or Konig lemma), we know 
that in this case a subtree rooted in the “wrong” vertex will be finite. This fact 
can be discovered at some point (recall that subtree is assumed to be decidable). 
Therefore, at each level we can wait until one of two possible directions is closed, 
and choose another one. This algorithm works only above level N, but the initial 
segment can be a compiled-in constant. Lemma 1 is proven. 

Application of Lemma 1 is made possible by the following statement: 

Lemma 2. Let G be a subtree of good strings. Then there exists a decidable 
subtree G' C G that contains all infinite branches of G. 

Proof. For each n let g(n) be the number of good strings of length n. Consider 
an integer g — limsupg(n). In other words, there exist infinitely many n such that 
g[n) = g but only finitely many n such that g(n) > g. We choose some N such that 
g(n) ^ g for all n f N and consider only levels N,N + 1,... 

A level n ^ N is called complete if g(n) — g. By our assumption there are 
infinitely many complete levels. On the other hand, the set of all complete levels 
is enumerable. Therefore, we can construct a computable increasing sequence 
n\ < ni < ... of complete levels. (To find », + i, we enumerate complete levels 
until we find n- l+ \ > ni.) 

There is an algorithm that for each i finds the list of all good strings of length 
nj. (It waits until g goods strings of length appear.) Let us call all those strings 
(for all i ) “selected”. The set of all selected strings is decidable. If a string of 
length nj is selected, then its prefix of length n, (for i < j ) is selected. It is easy 
to see now that selected strings and their prefixes form a decidable subtree G' that 
includes all infinite branches of G. 

Lemma 2 (and Theorem [T4l) are proven. 
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For a computable sequence X 1 X 2 ... we have C(x i.. .x n \n) = 0(1) and there¬ 
fore C(x i... x n ) ^ log n + 0( 1). One can prove that this last (seemingly weaker) 
inequality also implies computability of the sequence. However, the inequality 
C(x i... x n ) = 0(\ogn ) does not imply computability of x\X 2 ..., as the following 
result shows. 

Theorem 15. Let A be an enumerable set of natural numbers. Then for its char¬ 
acteristic sequence ao«l a 2 ... (a; = 1 ifi G A and a ; = 0 otherwise) we have 


C(aqa\.. ,a n ) = 0(\ogn). 

Proof To specify oq. ..a n it is enough to specify two numbers. The first is n and 
the second is the number of 1 ’s in ciq . .. a n , i.e., the cardinality of the set A n [0, n]. 
Indeed, for a given n, we can enumerate this set, and since we know its cardinality, 
we know when to stop the enumeration. Both of them use 0(\ogn) bits. □ 

This theorem shows that initial segments of characteristic sequences of enu¬ 
merable sets are far from being incompressible. 

As we know that for each n there exists an incompressible sequence of length 
n, it is natural to ask whether there is an infinite sequence x | X 2 ... such that its 
initial segment of arbitrary length n is incompressible (or at least c-incompressible 
for some c that does not depend on n). The following theorem shows that it is not 
the case. 

Theorem 16. There exists c such that for every sequence x \ xjxj ... there are in¬ 
finitely many n such that 


C(x\X 2 ■ ■ ■x n ) f n — log n + c 

Proof The main reason why it is the case is that the series £(1 /«) diverges. It 
makes possible to select the sets Ai,A 2 , ... with following properties: 

(1) each Ai consists of strings of length i; 

( 2 ) \Ai\ ^ 2 '//; 

(3) for every infinite sequence X 1 X 2 ... there are infinitely many i such that 
xi ...Xj G Ai. 

(4) the set A = U/A; is decidable. 

Indeed, starting with some A,, we cover about (1 /i) -fraction of the entire space 
£2 of all infinite sequences. Then we can choose A,- + i to cover other part of £2, and 
so on until we cover all £2 (it happens because 1 // + l/(i + 1 ) + ... + 1/ j goes to 
infinity). Then we can start again, providing a second layer of covering, etc. 
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It is easy to see that \A\\ + {Aj] + ... + |A,-| = 0(2' //): Each term is almost 
twice as big as the preceding one, therefore, the sum is (9(last term). Therefore, 
if we write down in lexicographic ordering all the elements of Ai,A 2 , ..., every 
element x of A,- will have ordinal number 0(2'/i). This number determines x 
uniquely and therefore for every x e A,- we have 

C(x) f \og(0(2')/i) = / — log/+ 0(1). 


□ 


Problems 

1. True or false: for every computable function / there exists a constant c such 
that C(x\y) + C(x\f(y)) +c for all x,y such that f(y) is defined. 

2. Prove that C (jc i... x n \ n ) ^ log n + O ( 1) for every characteristic sequence of 
an enumerable set. 

3*. Prove that there exists a sequence x\X 2 ■ ■ ■ such that C(x\...x n ) + n — 
2 log n — c for some c and for all /?. 

4*. Prove that if C(x\.. . x n ) + log/7 + c for some c and all n, then the sequence 
x\xi ... is computable. 


17 Incompressibility and lower bounds 

In this section we show how to apply Kolmogorov complexity to obtain a lower 
bound for the following problem. Let M be a Turing machine (with one tape) that 
duplicates its input: for every string x on the tape (with blanks on the right of x) 
it produces xx. We prove that M requires time Q.(n 2 ) if x is an incompressible 
string of length n. The idea is simple: the head of TM can carry finite number of 
bits with limited speed, therefore the speed of information transfer (measured in 
bitxcell/step) is bounded and to move n bits by n cells we need Q.(n 2 ) steps. 

Theorem 17. Let M be a Turing machine. Then there exists some constant c with 
the following property: for every k, every l + k and every t, if cells ct with i > k 
are initially empty, then the complexity of the string c/ + |C / + 2 • • • after t steps is 
bounded by ct/(l — k)+ 0(logl + logt). 

Roughly speaking, if we have to move information at least by l — k cells, then 
we can bring at most ct/(l — k ) bits into the area where there was no information 
at the beginning. 
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One technical detail: string c; + iC; + 2-• • denotes the visited part of the tape 
(and is finite). 

This theorem can be used to get a lower bound for duplication. Let x be an 
incompressible string of length n. We apply duplicating machine to the string 0 n x 
(with n zeros before x). After the machine terminates in t steps, the tape is O n xO n x. 
Let k — 2/7 and l — 'in. We can apply our theorem and get n f C(x) ^ ct/n + 
(9(log/7 + logf). Therefore, l — Q.(n 2 ) (note that log/ < 21og// unless t > n 1 ). 

Proof. Let u be an arbitrary point on the tape between k and /. A custom officer 
records what TM carries is its head while crossing point u from left to right (but 
not the time of crossing). The recorded sequence T u of TM-states is called trace 
(at point u). Each state occupies (9(1) bits since the set of states is finite. This 
trace together with u, k, I and the number of steps after the last crossing (at most 
t) is enough to reconstruct the contents of c/ + ic/ + 2 ... at the moment t. (Indeed, 
we can simulate the behavior of M on the right of u.) Therefore, C(c/ + iq + 2 • • •) ^ 
cN u + (9(log/) + <9(log/) where N u is the length of T u , i.e., the number of crossings 
at u. 

Now we add these inequalities for all u — k, k + 1,..., l. The sum of N u is 
bounded by t (since only one crossing is possible at a given time). So 

(/ - k)K(ci +l ci +2 ...)^t + {l-k) [<9(log/) + <9(logt)] 
and our theorem is proven. 

□ 

The original result (one of the first lower bounds for time complexity) was not 
for duplication but for palindrome recognition: every TM that checks whether its 
input is a palindrome (like abadaba) uses kl(n 2 ) steps for some inputs of length 
n. This statement can also be proven by the incompressibility method. 

Proof sketch: Consider a palindrome xx R of length 2n. Let u be an arbitrary 
position in the first half of xr: x — yz and length of y is u. Then the trace T u 
determines y uniquely if we record states of TM while crossing checkpoint u in 
both directions. Indeed, if strings with different y have the same trace, we can 
mix the left part of one computation with the right part of another one and get a 
contradiction. Taking all u between \x\/A and |jc|/ 2, we get the required bound. 
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18 Incompressibility and prime numbers 

Let us prove that there are infinitely many prime numbers. Imagine that there are 
only n prime numbers p\, - ■■■ p n . Then each integer TV can be factored as 

N = P\ P% ' ■ ■P'n ' 

where all k, do not exceed log N. Therefore, each N can be described by n integers 
ki,...,k n , and ki ^ log ./V for every i, so the total number of bits needed to describe 
N is (9(nloglog./Vj. But N corresponds to a string of length log A, so we get a 
contradiction if this string is incompressible. 


19 Incompressible matrices 

Consider an incompressible Boolean matrix of size n x n. Let us prove that its 
rank (over the field F 2 = (0,1}) is greater than m/2. 

Indeed, imagine that its rank is at most m/2. Then we can select m/2 columns 
of the matrix such that all other columns are linear combinations of the selected 
ones. Let k \,..., k n n be the numbers of these columns. 

Then, instead of specifying all bits of the matrix we can specify: 

(1) the numbers ki,...,k n (O(nlogn) bits) 

(2) bits in the selected columns (m 2 /2 bits) 

(3) m 2 /4 bits that are coefficients in linear combinations of selected columns 
needed to get non-selected columns, (m/2 bits for each of m/2 non-selected columns). 

Therefore, we get 0.75m 2 + O(nlogM) bits instead of n 2 needed for incom¬ 
pressible matrix. 

Of course, it is trivial to find a n x n Boolean matrix of full rank, but this 
construction is interesting as an illustration of the incompressibility technique. 


20 Incompressible graphs 

An undirected graph with n vertices can be represented by a bit string of length 
m(m — l)/2 (its adjacency matrix is symmetric). We call a graph incompressible if 
this string is incompressible. 

Let us show that an incompressible graph is necessarily connected. Indeed, 
imagine that it can be divided into two connected components, and one of them 
(the smaller one) has k vertices (k < m/2). Then the graph can be described by 
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(1) the list of numbers of k vertices in this component (/dog n bits), and 

(2) k(k— l)/2 and {n — k)(n — k~ l)/2 bits needed to describe both compo¬ 
nents. 

In (2) (compared to the full description of the graph) we save k(n — k ) bits for 
edges that go from one component to another one, and k(n — k) > 0 {k\ogn) for 
big enough n (recall that k < n/ 2 ). 


21 Incompressible tournaments 

Let M be a tournament, i.e., a complete directed graph with n vertices (for every 
two different vertices i and j there exists either edge i —> j or j —> i but not both). 

A tournament is transitive if its vertices are linearly ordered by the relation 
i -> j. 

Lemma. Each tournament of size 2 k — 1 has a transitive sub-tournament of 
size k. 

Proof (Induction by n.) Let x be a vertex. Then 2 k — 2 remaining vertices are 
divided into two groups: “smaller” than x and “greater” than x. At least one of 
the groups has 2 k 1 — 1 elements and contains a transitive sub-toumament of size 
k — 1. Adding x to it, we get a transitive sub-tournament of size k. □ 

This lemma gives a lower bound on the size of graph that does not include 
transitive ^-tournament. 

The incompressibility method provides an upper bound: an incompressible 
tournament with n vertices may have transitive sub-toumaments of <9 (log/?) size 
only. 

A tournament with n vertices is represented by n(n — l)/2 bits. If a tourna¬ 
ment R with /r vertices has a transitive sub-toumament R' of size k, then R can be 
described by: 

(1) the numbers of vertices in R' listed according to linear ^'-ordering (k log/? 
bits), and 

(2) remaining bits in the description of R (except for bits that describe relations 
inside R') 

In (2) we save k(k— 1)/2 bits, and in (1) we use /dog/7 additional bits. Since 
we have to lose more than we win, k = (9(log/7). 
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22 Discussion 


All these results can be considered as direct reformulation of counting (or prob¬ 
abilistic arguments). Moreover, counting gives us better bounds without 00- 
notation. 

But complexity arguments provide an important heuristics: We want to prove 
that random object x has some property and note that if x does not have it, then x 
has some regularities that can be used to give a short description for x. 

Problems 

1. Let x be an incompressible string of length n and let y be a longest substring 
of x that contains only zeros. Prove that |y| = 0(\ogn) 

2*. Prove that \y\ = L>(logn). 

3. Let win) be the largest integer such that for each tournament T on N = 
there exist disjoint sets A and B, each of cardinality w(n), such that 
Ax B Cf. Prove that w{n) ^ 2 [log/?]. (Hint: add 2 w{n) [logn] bit to describe 
nodes, and save w{n) 2 bits on edges. See 0 and 01.) 

23 k- and k + 1-head automata 

A k -head finite automaton has k (numbered) heads that scan from left to right the 
input string (which is the same for all heads). Automaton has a finite number of 
states. Transition table specifies an action for each state and each k-tuple of input 
symbols. Action is a pair: the new state, and the subset of heads to be moved. (We 
may assume that at least one head should be moved; otherwise we can precompute 
the next transition. We assume also that the input string is followed by blank 
symbols, so the automaton knows which heads have seen the entire input string.) 

One of the states is called an initial state. Some states are accepting states. 
An automaton A accepts string x if A comes to an accepting state after reading 
x, starting from the initial state and all heads placed at the left-most character. 
Reading x is finished when all heads leave x. We require that this happens for 
arbitrary string x. 

For k — 1 we get the standard notion of finite automaton. 

Example : A 2-head automaton can recognize strings of form x#x (where x is 
a binary string). The first head moves to #-symbol and then both heads move and 
check whether they see the same symbols. 
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It is well known that this language cannot be recognized by 1-head finite au¬ 
tomaton, so 2 -head automata are more powerful that 1 -head ones. 

Our goal is to prove the same separation between k-heads automata and (k + 
l)-heads automata for arbitrary k. 

Theorem 18. For every k f 1 there exists a language that can be recognized by a 
(k + 1 )-head automaton but not by a k-head one. 

Proof. The language is similar to the language considered above. For example, 
for k — 2 we consider a language consisting of strings 

x#y#z#z#y#x 

Using three heads, we can easily recognize this language. Indeed, the first head 
moves from left to right and ignores the left part of the input string, while the 
second and the third one are moved to the left copies of x and y. These copies are 
checked when the first head crosses the right copies of y and x. Then only one 
unchecked string z remains, and there are two heads at the left of it, so this can be 
done. 

The same approach shows that an automaton with k heads can recognize lan¬ 
guage Ljv that consists of strings 


X 1 #-^ 2 # • • • #Xn#Xn# . . . #X2#X\ 

forN— (k— 1) + (k — 2) +... + 1 = k(k— l)/2 (and for all smaller N). 

Let us prove now that k-head automaton A cannot recognize Ly if N is bigger 
than k(k— 1 )/2. (In particular, no automaton with 2 heads can recognize L 3 and 
even Lo.) 

Let us fix a string 


X = Xl#X2# ■ ■ ■ #Xn#Xn# . . . #X2#Xl 

where all xi have the same length / and the string x\xi . . .v,v is an incompressible 
string (of length Nl). String x is accepted by A. In our argument the following 
notion is crucial: We say that an (unordered) pair of heads “covers” x m if at some 
point one head is inside the left copy of x m while the other head (from this pair) is 
inside the right copy. 

After that the right head can visit only strings x m -\.... .x\ and left head cannot 
visit the left counterparts of those strings (they are on the left of it). Therefore, 
only one x m can be covered by a given pair of heads. 
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In our example we had three heads (and, therefore, three pairs of heads) and 
each string jci,X2 A3 was covered by one pair. 

The number of pairs is k(k — l)/2 for k heads. Therefore (since N > k(k — 
l)/2) there exists some x m that was not covered at all during the computation. 
We show that conditional complexity of x m when all other v, are known does not 
exceed (9(log/). (The constant here depends on N and A, but not on /.) This con¬ 
tradicts to the incompressibility of x\.. .x^r (we can replace x m by self-delimiting 
description of x m when other v ; are known and get a shorter description of an 
incompressible string). 

The bound for the conditional complexity of x m can be obtained in the follow¬ 
ing way. During the accepting computation we take special care of the periods 
when one of the heads is inside x m (on the left or on the right). We call these pe¬ 
riods “critical sections”. Note that each critical section is either L-critical (some 
heads are inside the left copy of x m ) or R-critical but not both (no pair of heads 
covers x m ). Critical section starts when one of the heads moves inside x m (other 
heads can also move in during the section) and ends when all heads leave x m . 
Therefore, the number of critical sections during the computation is at most 2k. 

Let us record the positions of all heads and the state of automaton at the be¬ 
ginning and at the end of each critical section. This requires 0(\ogl) bits (note 
that we do not record time). 

We claim that this information (called trace in the sequel) determines x m if all 
other Xi are known. To see why, let us consider two computations with different 
x m and x' m but the same Xj for i =4 m and the same traces. 

Equal traces allow us to “cut and paste” these two computations on the bound¬ 
aries of critical sections. (Outside the critical sections computations are the same, 
because the strings are identical except for x m , and state and positions after each 
critical section are included in a trace.) Now we take L-critical sections from one 
computation and R-critical sections from another one. We get a mixed computa¬ 
tion that is an accepting run of A on a string that has x m on the left and x' m on the 
right. Therefore, A accepts a string that it should not accept. □ 


24 Heap sort: time analysis 

Let us assume that we sort numbers 1,2We have N\ possible permuta¬ 
tions. Therefore, to specify a permutation we need about log(AH) bits. Stirling’s 
formula says that N\ ~ (. N/e ) N , therefore the number of bits needed to specify 
one permutation is NlogN + O(N). As usual, most of the permutations are in- 
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compressible in the sense that they have complexity at least A log A — 0(A). We 
estimate the number of operations for heap sort in the case of an incompressible 
permutation. 

Heap sort (we assume in this section that the reader knows what it is) consists 
of two phases. First phase creates a heap out of the input array. (The indexes in 
array a[\ ..A] form a tree where 2 i and 2 i + 1 are sons of i. The heap property says 
that ancestor has bigger value that its descendants.) 

Transforming the array into a heap goes as follows: for each i = A,A — 1,..., 1 
we make the heap out of subtree rooted at i assuming that j-subtrees for j > i are 
heaps. Doing this for the node i, we need O(k) steps where k is the distance 
between node i and the leaves of the tree. Here k — 0 for about half of the nodes, 
k — 1 for about 1/4 of the nodes etc., and the average number of steps per node is 
0(Y^k2~ k ) = 0(1); the total number of operations is 0(N). 

Important observation: after the heap is created, the complexity of array a [ 1. .A] 
is still AlogA + 0(A), if the initial permutation was incompressible. Indeed, 
“heapifying” means composing the initial permutation with some other permu¬ 
tation (which is determined by results of comparisons between array elements). 
Since the total time for heapifying is 0(N), there are at most 0(A) comparisons 
and their results form a bit string of length 0(A) that determines the heapify¬ 
ing permutation. The initial (incompressible) permutation is a composition of 
the heap and O (A) -permutation, therefore heap has complexity at least AlogA — 
0(N). 

The second phase transforms the heap into a sorted array. At every stage 
the array is divided into two parts: a[l..n] is still a heap, but a[n+ 1 ..A] is the 
end of the sorted array. One step of transformation (it decreases n by 1) goes as 
follows: the maximal heap element a[ 1] is taken out of the heap and exchanged 
with a[n]. Therefore, a[n..N] is now sorted, and the heap property is almost true: 
ascendant has bigger value that descendant unless ascendant is a[n] (that is now 
in root position). To restore heap property, we move a[n] down the heap. The 
question is how many steps do we need. If the final position is d n levels above the 
leaves level, we need log A — d n exchanges, and the total number of exchanges is 
AlogA-][>/„. 

We claim that Y d n — 0(A) for incompressible permutations, and, therefore, 
the total number of exchanges is AlogA + 0(N). 

So why Y d n is 0(A)? Let us record the direction of movements while ele¬ 
ments fall down through the heap (using 0 and 1 for left and right). We don’t use 
delimiters to separate strings that correspond to different n and use A log A — Y dj 
bits altogether. Separately we write down all d n in self-delimiting way. This re- 
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quires £(21ogJ,-+ (9(1)) bits. All this information allows us to reconstruct the 
exchanges during the second phase, and therefore to reconstruct the initial state 
of the heap before the second phase. Therefore, the complexity of heap before 
the second phase (which is A log A — (9(A)) does not exceed A log A — + 

£(21og d n ) + (9(A), therefore, £(J„ — 21og d n ) = (9(A). Since 21og d n < 0.5 d n 
for d n > 16 (and all smaller d„ have sum (9(A) anyway), we conclude that £ d n = 
0(A). 

Problems 

1*. Prove that for most pairs of binary strings x,y of length n every common 
subsequence of x and y has length at most 0.99n (for large enough n). 


25 Infinite random sequences 

There is some intuitive feeling saying that a fair coin tossing cannot produce se¬ 
quence 

00000000000000000000000 ... 

or 

01010101010101010101010 

so infinite sequences of zeros and ones can be divided in two categories. Ran¬ 
dom sequences are sequences that are plausible outcomes of coin tossing; non- 
random sequences (including the two sequences above) are not plausible. It is 
more difficult to provide an example of a random sequence (it somehow becomes 
non-random after the example is provided), so our intuition is not very reliable 
here. 


26 Classical probability theory 

Let Q. be the set of all infinite sequences of zeros and ones. We define the uniform 
Bernoulli measure on Q. as follows. For each binary string x let Q x be the set of 
all sequences that have prefix x (a subtree rooted at x). 

Consider a measure P such that P(£l x ) = 2 _ K Measure theory allows us to 
extend this measure to all Borel sets (and even further). 

A set X C £2 is called a null set if P(X) is defined and P(X) — 0. Let us give a 
direct equivalent definition that is useful for constructive version: 
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A set X C Q is a null set if for every £ > 0 there exists a sequence of binary 
strings xo,x\ ,... such that 

(1) IC^Ufl,U...; 

(2) £2-W < £. 

i 

Note that 2“I*'I is P(£l Xi ) according to our definition. In words: A is a null 
set if it can be covered by a sequence of intervals Ll Xi of arbitrarily small total 
measure. 

Examples'. Each singleton is a null set. A countable union of null sets is a null 
set. A subset of a null set is a null set. The set LI is not a null set (by compactness). 
The set of all sequences that have zeros at positions with even numbers is a null 
set. 


27 Strong Law of Large Numbers 

Informally, the strong law of large numbers (SLLN) says that random sequences 
xqX\ ... have limit frequency 1/2, i.e., 

Xq +Xi + . . . +X M _1 1 

lim-= -. 

n—>°° n 2 

However, the word “random” here is used only as a shortcut: the full meaning is 
that the set of all sequences that do not satisfy SLLN (do not have limit frequency 
or have it different from 1/2) is a null set. 

In general, when people say that“P( co) is true for random at E <T”, it usually 
means that the set 

{© | P((o) is false} 

is a null set. 

Proof sketch for SLLN : it is enough to show that for every S > 0 the set N§ of 
sequences that have frequency greater than 1/2 + 5 for infinitely many prefixes, 
has measure 0. (After that we use that a countable union of null sets is a null set.) 
Lor each n consider the probability p(n. 3) of the event “random string of length 
n has more than (1/2 + 8)n ones”. The crucial observation is that 

Y^p(n,8) <°o 
n 

for each 5 > 0. (Actually, p(n, 5) is exponentially decreasing as n —> proof 
uses Stirling’s approximation for factorials.) If the series above has a finite sum. 
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for every £ > 0 one can find an integer N such that 


£ P(n,8) < £. 

n>N 

Consider all strings z of length greater than N that have frequency of ones greater 
than 1/2 + 5. The sum of P(£l z ) is equal to Y,n>NP( n ? 8) < £, and N £ is covered 
by family Q z . 


28 Effectively null sets 

The following notion was introduced by Per Martin-Lof. A set A C £1 is an effec¬ 
tively null set if there is an algorithm that gets a rational number £ > 0 as input 
and enumerates a set of strings {xq j x\ j X 2 , ...} such that 

(OIC^UO.U^U.,,; 

(2) £2-W < £. 

i 

The notion of effectively null set remains the same if we allow only £ of form 
1 /2^, or if we replace “<” by in (2). 

Every subset of an effectively null set is also an effectively null set (evident 
observation). 

For a computable infinite sequence ft) of zeros and ones the singleton (ft)} is 
a null set. (The same happens for all non-random ft), see below.) 

An union of two effectively null sets is an effectively null set. (Indeed, we can 
find enumerable coverings of size £/2 for both and combine them.) 

More general statement requires preliminary definition. By “covering algo¬ 
rithm” for an effectively null set we mean an algorithm mentioned in the definition 
(that gets £ and generates a covering sequence of strings with sum of measures less 
than £). 

Lemma. LetX o, Ai, A 2 , ... be a sequence of effectively null sets such that there 
exists an algorithm that given an integer i produces (some) covering algorithm for 
Xj. Then LJA, is an effectively null set. 

Proof To get an £-covering for UA„ we put together (fi/2)-covering for Xq, 
(fi/4)-covering for X\, etc. To generate this combined covering, we use the al¬ 
gorithm that produces covering for A, from i. □ 
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29 Maximal effectively null set 

Up to now the theory of effectively null sets just repeats the classical theory of null 
sets. The crucial difference is in the following theorem (proved by Martin-Lof): 

Theorem 19. There exists a maximal effectively null set, i.e., an effectively null 
set N such that X C N for every effectively null set X. 

(Trivial) reformulation: the union of all effectively null sets is an effectively 
null set. 

Proof. We cannot prove this theorem by applying the above lemma to all effec¬ 
tively null sets (there are uncountably many of them, since every subset of an 
effectively null set is an effectively null set). 

But we don’t need to consider all effectively null sets; it is enough to consider 
all covering algorithms. For a given algorithm (that gets positive rational number 
as input and generates binary strings) we cannot say (effectively) whether it is 
a covering algorithm or not. But we may artificially enforce some restrictions: 
if algorithm (for given e > 0) generates strings xo,xi : ..., we can check whether 
2 l x ° + ...+2~ < e or not; if not, we delete ,iy from the generated sequence. 

Let us denote by A' the modified algorithm (if A was an original one). It is easy to 
see that 

(1) if A was a covering algorithm for some effectively null set, then A' is 
equivalent to A (the condition that we enforce is never violated). 

(2) For every A the algorithm A' is (almost) a covering algorithm for some null 
set; the only difference is that the infinite sum £2“I*'I can be equal to e even if all 
finite sums are strictly less than e. 

But this is not important: we can apply the same arguments (that were used 
to prove Lemma) to all algorithms Aq,A' 1; ... where Ao,Ai,... is a sequence of 
all algorithms (that get positive rational numbers as inputs and enumerate sets of 
binary strings). 

□ 

Definition. A sequence ft) of zeros and ones is called (Martin-Lof) random 
with respect to the uniform Bernoulli measure if ft) does not belong to the maximal 
effectively null set. 

(Reformulation: “... if ft) does not belong to any effectively null set.” ) 

Therefore, to prove that some sequence is non-random we need to show that it 
belongs to some effectively null set. 
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Note also that a set X is an effectively null set if and only if all elements of X 
are non-random. 

This sounds like a paradox for people familiar with classical measure theory. 
Indeed, we know that measure somehow reflects the “size” of set. Each point is a 
null set, but if we have too many points, we get a non-null set. Here (in Martin-Lof 
theory) the situation is different: if each element of some set forms an effectively 
null singleton (i.e., is non-random), then the entire set is an effectively null one. 

Problems 

1. Prove that if sequence X 0 X 1 X 2 ... of zeros and ones is (Martin-Lof) random 
with respect to uniform Bernoulli measure, then the sequence OOOxi X 2 ... is also 
random. Moreover, adding arbitrary finite prefix to a random sequence, we get 
a random sequence, and adding arbitrary finite prefix to a non-random sequence, 
we get a non-random sequence. 

2. Prove that every (finite) binary string appears infinitely many times in every 
random sequence. 

3. Prove that every computable sequence is non-random. Give an example of 
a non-computable non-random sequence. 

4. Prove that the set of all computable infinite sequences of zeros and ones is 
an effectively null set. 

5*. Prove that if a sequence xoxi ... is not random, then n — C(x o.. .x„_i | n) 
tends to infinity as n —> 


30 Gambling and selection rules 

Richard von Mises suggested (around 1910) the following notion of a random 
sequence (he uses German word Kollektiv ) as a basis for probability theory. A 
sequence X 0 X 1 X 2 ... is called (Mises) random, if 

(1) it satisfies the strong law of large numbers, i.e., the limit frequency of l’s 
in it is 1/2: 

Xq-\-X\-\ - 1 

lim-= 

n-> oo yi 2 

(2) the same is true for every infinite subsequence selected by an “admissible 
selection rule”. 

Examples of admissible selection rules: (a) select terms with even indices; 
(b) select terms that follow zeros. The first rule gives 0100... when applied 
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to 00100100... (selected terms are underlined). The second rule gives 0110... 
when applied to 00101100... 

Mises gave no exact definition of admissible selection rule (at that time the 
theory of algorithms did not exist yet). Later Church suggested the following 
formal definition of admissible selection rule. 

An admissible selection rule is a total computable function S defined on finite 
strings that has values 1 (“select”) and 0 (“do not select”). To apply S to a sequence 
xqx\X 2 ■ ■ • we select all x n such that 5(voV|.. .x n -\ ) = 1. Selected terms form 
a subsequence (finite or infinite). Therefore, each selection rule S determines a 
mapping oy : Q. —* E, where E is the set of all finite and infinite sequences of zeros 
and ones. 

For example, if 5(x) = 1 for every string x, then oy is an identity mapping. 
Therefore, the first requirement in Mises approach follows from the second one, 
and we come to the following definition: 

A sequence x = * 0 * 1*2 ... is Mises-Church random, if for every admissible 
selection rule S the sequence cy/v) is either finite or has limit frequency 1/2. 

Church’s definition of admissible selection rules has the following motivation. 
Imagine you come to a casino and watch the outcomes of coin tossing. Then you 
decide whether to participate in the next game or not, applying S to the sequence 
of observed outcomes. 


31 Selection rules and Martin-Lof randomness 

Theorem 20. Applying an admissible selection rule (according to Church def¬ 
inition) to a Martin-Lof random sequence, we get either a finite sequence or a 
Martin-Lof random sequence. 

Proof Let 5 be a function that determines selection rule < 75 . 

Let E v be the set of all finite of infinite sequences that have prefix x (here x is 
a finite binary string). 

Consider the set A x — (J s 1 (Z A ) of all (infinite) sequences ft) such that selected 
subsequence starts with x. If x = A (empty string), then A x — LI. 

Lemma. The set A x has measure at most 

Proof What is Ao? In other terms, what is the set of all sequences ft) such that the 
selected subsequence (according to selection rule oy) starts with 0? Consider the 
set B of all strings z such that S(z) — 1 but S(z') = 0 for each prefix z! of z. These 
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strings mark the places where the first bet is made. Therefore, 

Aq = UjfT-o | Z G B} 

and 

A\ = U{£2 z i | zEB}. 

In particular, the sets Ao and A\ have the same measure and are disjoint, therefore 

P(A„)=P(A,)<1 

From the probability theory viewpoint, P{Aq ) [resp., P(A \ jj is the probability of 
the event “the first selected term will be 0 [resp. 1 ]”, and both events have the 
same probability (that does not exceed 1 / 2 ) for evident reasons. 

We can prove in the same way that Aoo and Aoi have the same measure. (See 
below the details.) Since they are disjoint subsets of Ao, both of them have mea¬ 
sure at most 1/4. The sets Aio and An also have equal measure and are subsets of 
Ai, therefore both have measure at most 1/4, etc. 

If this does not sound convincing, let us give an explicit description of Aoo- 
Let Bq be the set of all strings z such that 

(1) S(z) = l; 

( 2 ) there exists exactly one proper prefix z! of z such that S(z!) — 1; 

(3) z !0 is a prefix of z. 

In other terms, Bq corresponds to the positions where we are making our sec¬ 
ond bet while our first bet produced 0. Then 

Aoo = U{£2-o | z G /?o} 

and 

^01 = U{f2-1 | Z G Bq}. 

Therefore Aoo an d Aoi indeed have equal measures. 

Lemma is proven. 

□ 

It is also clear that A x is the union of intervals E v that can be effectively gen¬ 
erated if x is known. (Here we use the computability of S .) 

Proving Theorem [20l assume that < 75 (a>) is an infinite non-random sequence. 
Then {to} is effectively null singleton. Therefore, for each £ one can effectively 
generate intervals Q Xl , Q X2 ■ ■ ■ ■ whose union covers Os(co). The preimages 
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cover co. Each of these preimages is an enumerable union of intervals, and if we 
combine all these intervals we get a covering for co that has measure less than £. 
Thus, (O is non-random, so Theorem l20l is proven. □ 

Theorem 21. Every Martin-Lof random sequence has limit frequency 1/2. 

Proof. By definition this means that the set -i SLLN of all sequences that do not 
satisfy SLLN is an effectively null set. As we have mentioned, this is a null set and 
the proof relies on an upper bound for binomial coefficients. This upper bound 
is explicit, and the argument showing that the set ->SLLN is a null set can be 
extended to show that ->SLLN is an effectively null set. □ 

Combining these two results, we get the following 

Theorem 22. Every Martin-Lof random sequence is also Mises-Church random. 


Problems 

1. The following selection rule is not admissible according to Mises definition: 
choose all terms X 2 n such that.x^+i = 0. Show that (nevertheless) it gives (Martin- 
Lof) random sequence if applied to a Martin-Lof random sequence. 

2. Let xqX\Xi ... be a Mises-Church random sequence. Let a^ — |{« < N \ 
x n — 0, x n+ \ — 1} |. Prove that a^/N — » 1/4 as N —* 


32 Probabilistic machines 

Consider a Turing machine that has access to a source of random bits. Imagine, 
for example, that it has some special states a. h. c with the following properties: 
when the machine reaches state a, it jumps at the next step to one of the states b 
and c with probability 1/2 for each. 

Another approach: consider a program in some language that allows assign¬ 
ments 

a random; 

where random is a keyword and a is a Boolean variable that gets value 0 or 1 
when this statement is executed (with probability 1/2; each new random bit is 
independent of the previous ones). 

Lor a deterministic machine output is a function of its input. Now it is not the 
case: for a given input machine can produce different outputs, and each output has 
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some probability. So for each input the output is a random variable. What can be 
said about this variable? We will consider machines without inputs; each machine 
of this type determines a random variable (its output). 

Let M be a machine without input. (For example, M can be a Turing machine 
that is put to work on an empty tape, or a Pascal program that does not have read 
statements.) Now consider probability of the event “M terminates”. What can be 
said about this number? 

More formally, for each sequence ft) £ Tl we consider the behavior of M if 
random bits are taken from ft). For a given ft) the machine either terminates or not. 
Then p is the measure of the set T of all ft) such that M terminates using ft). It 
is easy to see that T is measurable. Indeed, T is a union of T n , where T n is the 
set of all ft) such that M stops after at most n steps using ft). Each T n is a union 
of intervals £l t for some strings t of length at most n (machine can use at most n 
random bits if it runs in time n ) and therefore is measurable; the union of all T n is 
an open (and therefore measurable) set. 

A real number p is called enumerable from below or lower semicomputable if 
p is a limit of increasing computable sequence of rational numbers: p = limp,-, 
where po ^ p\ ^ P 2 ^ • • • and there is an algorithm that computes pi given i. 

Lemma. A real number p is lower semicomputable if and only if the setX p = 
(r £ Q | r < p} is ( computably ) enumerable. 

Proof (1) Let p be the limit of a computable increasing sequence p t . For every 
rational number r we have 


r<p^3i[r< Pi \. 

Let ro,ri,... be a computable sequence of rational numbers such that every ra¬ 
tional number appears infinitely often in this sequence. The following algorithm 
enumerates X p : at ith step, compare r, and /?,•; if r, < p t , output r,. 

(2) If X p is computably enumerable, let ro, r\, r 2 , ■ ■ ■ be its enumeration. Then 
p n = max(ro, r\ ,..., r n ) is a non-decreasing computable sequence of rational num¬ 
bers that converges to p. □ 

Theorem 23. (a) Let M be a probabilistic machine without input. Then M’s prob¬ 
ability of termination is lower semicomputable. 

(b) Let p be a lower semicomputable number in [0,1]. Then there exists a 
probabilistic machine that terminates with probability p. 


34 



Proof, (a) Let M be a probabilistic machine. Let p n be the probability that M 
terminates after at most n steps. The number p n is a rational number with denom¬ 
inator 2" that can be effectively computed for a given n. (Indeed, the machine 
M can use at most n random bits during n steps. For each of 2 n binary strings 
we simulate behavior of M and see for how many of them M terminates.) The 
sequence po,pi,P 2 ■ ■ ■ is an increasing computable sequence of rational numbers 
that converges to p. 

(b) Let p be a real number in [0,1] that is lower semicomputable. Let po f 
Pi ^ P 2 ^ • • • be an increasing computable sequence that converges to p. Consider 
the following probabilistic machine. It treats random bits bq,bi,b 2 - ■ ■ as binary 
digits of a real number 

/3 = 0.bob\b2 ■ ■ ■ 

When i random bits are generated, we have lower and upper bounds for /3 that 
differ by 2~ ! . If the upper bound /3, turns out to be less than /?/, machine terminates. 
It is easy to see that machine terminates for given /3 = O.bobi ... if and only if 
/3 < p. Indeed, if an upper bound for /3 is less than a lower bound for p, then 
/3 < p. On the other hand, if /3 < p, then ff < pi for some i (since /3, — > /3 and 
Pi —> p as i —> °°). 

□ 

Now we consider probabilities of different outputs. Here we need the follow¬ 
ing definition: A sequence po,pi,P 2 • • • of real numbers is lower semicomputable, 
if there is a computable total function p of two variables (that range over natural 
numbers) with rational values (with special value —added) such that 

P(i, 0) <p{i, 1) <p(i,2) ^ ... 


and 

for every i. 

Lemma. A sequence po,p\,p 2 , ■ ■ ■ of reals is lower semicomputable if and 
only if the set of pairs 

{(hr) | r<p{} 

is enumerable. 

Proof. Let po, p \,... be lower semicomputable and p, = lim n p(i, n). Then 

r < pjh$ 3n [r < p(i , n)] 
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and we can check r < p(i,n) for all pairs (i,r) and for all n. If r < p(i.n), pair 
(z, r) is included in the enumeration. 

On the other hand, if the set of pairs is enumerable, for each n we let p(i. n ) be 
the maximum value of r for all pairs (z, r) (with given z) that appear during n steps 
of the enumeration process. (If there are no pairs, p(i,n) = —°°.) The lemma is 
proven. 

□ 

Theorem 24. ( a ) Let M be a probabilistic machine without input that can produce 
natural numbers as outputs. Let pi be the probability of the event “M terminates 
with output i”. Then sequence po, p\.... is lower semicomputable and £,• Pi f 1 . 

fb) Let po,pi,p 2 ... be a sequence of non-negative real numbers that is lower 
semicomputable, and Y*iPi f 1- Then there exists a probabilistic machine M that 
outputs i with probability ( exactly ) p t . 

Proof Part (a) is similar to the previous argument: let p(i,n) be the probability 
that M terminates with output i after at most n steps. Than p(i,0),p(i, 1),.. . is a 
computable sequence of increasing rational numbers that converges to p t . 

(b) is more complicated. Recall the proof of the previous theorem. There we 
had a “random real” /3 and “termination region” [0. p) where p was the desired 
termination probability. (If ft is in termination region, machine terminates.) 

Now termination region is divided into parts. For each output value i there is 
a part of termination region that corresponds to i and has measure p t . Machines 
terminates with output i if and only if /3 is inside zth part. 

Let us consider first a special case when sequence pi is a computable sequence 
of rational numbers, Then zth part is a segment of length p ; . These segments are 
allocated from left to right according to “requests” pi. One can say that each 
number z comes with request p t for space allocation, and this request is granted. 
Since we can compute the endpoints of all segments, and have lower and upper 
bound for /3, we are able to detect the moment when /3 is guaranteed to be inside 
z-th part. 

In the general case the construction should be modified. Now each i comes 
to space allocator many times with increasing requests p(i,0) 1 p(i 1 l),p(z, 2),...; 
each time the request is granted by allocating additional interval of length p(i, n) — 
p(i, n — l). Note that now zth part is not contiguous: it consists of infinitely many 
segments separated by other parts. But this is not important. Machine terminates 
with output i when current lower and upper bounds for /3 guarantee that /3 is inside 
zth part. The interior of zth part is a countable union of intervals, and if /3 is inside 
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this open set, machine will terminate with output i. Therefore, the termination 
probability is the measure of this set, i.e., equals lim n p(i, n). □ 


Problems 

1. A probabilistic machine without input terminates for all possible coin tosses 
(there is no sequence of coin tosses that leads to infinite computation). Prove that 
the computation time is bounded by some constant (and machine can produce only 
finite number of outputs). 

2. Let pi be the probability of termination with output i for some probabilistic 
machine and £p,- = 1. Prove that all pt are computable, i.e., for every given i and 
for every rational e > 0 we can find (algorithmically) an approximation to pi with 
absolute error at most e. 


33 A priori probability 

A sequence of real numbers po,p\,p 2 , ■ ■ ■ is called an lower semicomputable 
semimeasure if there exists a probabilistic machine (without input) that produces i 
with probability p,-. (As we know, po, pi, ■ ■■ is a lower semicomputable semimea¬ 
sure if and only if p, is lower semicomputable and £p; ^1.) 

Theorem 25. There exists a maximal lower semicomputable semimeasure m ( max- 
imcility means that for every lower semicomputable semimeasure m' there exists a 
constant c such that m!(i ) ^ cm(i) for all i). 

Proof Let Mo,Mi,... be a sequence of all probabilistic machines without input. 
Let M be a machine that starts by choosing a natural number i at random (so 
that each outcome has positive probability) and then emulates M,. If p,- is the 
probability that i is chosen, m is the distribution on the outputs of M and in' is the 
distribution on the outputs of My, then m{x) f pim!{x ) for all x. □ 

The maximal lower semicomputable semimeasure is called a priori probabil¬ 
ity. This name can be explained as follows. Imagine that we have a black box that 
can be turned on and prints a natural number. We have no information about what 
is inside. Nevertheless we have an “a priori” upper bound for probability of the 
event “i appears” (up to a constant factor that depends on the box but not on i). 

The same definition can be used for real-valued functions on strings instead of 
natural numbers (probabilistic machines produce strings; the sum £p(jc) is taken 
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over all strings x, etc.) — in this way we may define discrete a priori probability 
on binary strings. (There is another notion of a priori probability for strings, called 
continuous a priori probability, but we do not consider it is this survey.) 


34 Prefix decompression 

The a priori probability is related to a special complexity measure called prefix 
complexity. The idea is that description is self-delimited; the decompression pro¬ 
gram had to decide for itself where to stop reading input. There are different 
versions of machines with self-delimiting input; we choose one that is technically 
convenient though may be not the most natural one. 

A computable function whose inputs are binary strings is called a prefix func¬ 
tion, if for every string x and its prefix y at least one of the values fix) and f(y) is 
undefined. (So a prefix function cannot be defined both on a string and its prefix 
or continuation.) 

Theorem 26. There exists a prefix decompressor D that is optimal among prefix 
decompressors: for each computable prefix function I)' there exists some constant 
c such that 

Cd{x) < C r y (x) +c 

for all x. 

Proof. To prove a similar result for plain Kolmogorov complexity we used 

DfpOly ) =p(y) 

where p is a program p with doubled bits and p(y ) stands for the output of program 
p with input y. This D is a prefix function if and only if all programs compute 
prefix functions. We cannot algorithmically distinguish between prefix and non¬ 
prefix programs (this is an undecidable problem). However, we may convert each 
program into a prefix one in such a way that prefix programs remain unchanged. 
Let us explain how this can be done. 

Let 

D{pOly) = [p] (y) 

where [p] (y) is computed as follows. We apply in parallel p to all inputs and get 
a sequence of pairs (yi-Zi) such that p(y,-) = Zi- Select a “prefix” subsequence by 
deleting all (y,, zi) such that y, is a prefix of y 7 or y / is a prefix of y,- for some j < i. 
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This process does not depend on y. To compute [p\ (v), wait until y appears in the 
selected subsequence, i.e. y — v, for a selected pair (yi-Zj), and then output zp 

The function y [p] (y) is a prefix function for every p, and if program p 
computes a prefix function, then [p] (y) = p(y). 

Therefore, D is an optimal prefix decompression algorithm. □ 

Complexity with respect to an optimal prefix decompression algorithm is called 
prefix complexity and denoted by K(x). 


35 Prefix complexity and length 

As we know, C(x ) + |*| + 0(1) (consider identity mapping as decompression al¬ 
gorithm). But identity mapping is not a prefix one, so we cannot use this argument 
to show that K{x) ^ |*| + 0( 1), and in fact this is not true, as the following theorem 
shows. 

Theorem 27. 

+ 1 . 

Proof. For every x let p x be the shortest description for x (with respect to given 
prefix decompression algorithm). Then \p x \ = K(x) and all strings p x are incom¬ 
patible. (We say that p and q are compatible if p is a prefix of q or vice versa.) 
Therefore, the intervals Q. Px are disjoint; they have measure 2“ /,A = 2~ K ( X \ so 
the sum does not exceed 1. □ 

If K(x) f |jc| + (9(1) were true, then Hy2 ”W would be finite, but it is not the 
case (for each natural number n the sum over strings of length n equals 1). 
However, we can prove weaker lower bounds: 

Theorem 28. 


K{x) <2|x|+0(l); 

K(x) ^ |v| +21og \x\ +(9(1); 

K(x) + \x\ + log |*| +21oglog|v| + <9(1) 
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Proof. The first bound is obtained if we use D(.v01) = x. (It is easy to check that 
D is prefix function.) The second one uses 


D(bin(|jt|)Obc) = jc 

where bin(|jt|) is the binary representation of the length of string x. Iterating this 
trick, we let 

D(bin(|bin(|jt|)|)01bin(|jt|)jc) —x 

and get the third bound etc. □ 

Let us note that prefix complexity does not increase when we apply algorith¬ 
mic transformation: K(A(x)) f K(x) +0(1) for every algorithm A (the constant 
in 0(1) depends on A). Let us take optimal decompressor (for plain complexity) 
as A. We conclude that K(x) does not exceed K(p) if p is a description of x. 
Combining this with theorem above, we conclude that K(x) + 2C(x) + 0(1), that 
K(x) f C(x) +21ogC(jc) + 0(1), etc. 

In particular, the difference between plain and prefix complexity for n-bit 
strings is O(logn). 


36 A priori probability and prefix complexity 

We have now two measures for a string (or natural number) x. The a priori prob¬ 
ability m{x) measures how probable is to see x as an output of a probabilistic ma¬ 
chine. Prefix complexity measures how difficult is to specify x in a self-delimiting 
way. It turns out that these two measures are closely related. 

Theorem 29. 

K{x) = — logm(x) + 0(1) 

(Here m(x ) is a priori probability; log stands for binary logarithm.) 

Proof. The function K is enumerable from above; therefore, x h-> 2~ k W is lower 
semicomputable. Also we know that £ V 2-*W + 1, therefore 2 is a lower 
semicomputable semimeasure. Therefore, 2~ K ^ f cm(x) and K(x) f —log m(x) + 
0(1). To prove that K(x) ^ —log m(x) + 0(1), we need the following lemma 
about memory allocation. 

Let the memory space be represented by [0,1]. Each memory request asks for 
segment of length 1,1/2,1/4,1/8, etc. that is properly aligned. Alignment means 
that for segment of length 1/2* only 2 k positions are allowed ([0,2 - *], [2”*, 2 ■ 
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2 k ], etc.). Allocated segments should be disjoint (common endpoints are al¬ 
lowed). Memory is never freed. 

Lemma. For each computable sequence of requests 2~ n ' such that £2 - ”' ^ 1 
there is a computable sequence of allocations that grant all requests. 

Proof We keep a list of free space divided into segments of size 2~ k . Invariant 
relation: all segments are properly aligned and have different size. Initially there 
is one free segment of length 1. When a new request of length w comes, we pick 
up the smallest segment of length at least w. This strategy is sometimes called 
“best fit” strategy. (Note that if the free list contains only segments of length 
w/2, w/4,..., then the total free space is less than w, so it cannot happen by our 
assumption.) If the smallest free segment of length at least w has length w, we 
simple allocate it (and delete from the free list). If it has length w' > w, then 
we split w' into parts of size w, w. 2w. 4w, .... w'/4. w' /2 and allocate the left w- 
segment putting all others in the free list, so the invariant is maintained. □ 

Reformulation of the lemma: ... there is a computable sequence of incompat¬ 
ible strings x,- such that |x,j = n,. (Indeed, an aligned segment of size 2~" is I x for 
some string x for length n.) 

Corollary. For each computable sequence of requests 2~ n ‘ such that £2 _Wi ^ 
1 we have K(i ) ^ n,-. 

(Indeed, consider a decompressor that maps x, to i. Since all x,- are pairwise 
incompatible, it is a prefix function.) 

Now we return to the proof. Since m is lower semicomputable, there ex¬ 
ists a non-negative function M : (x, k) M(x 1 k) of two arguments with ratio¬ 
nal values that is non-decreasing with respect to the second argument such that 

lim / c M(x 1 k) — m(x). 

Let M'(x, k ) be the smallest number in the sequence 1,1/2,1/4,1/8,..., 0 that 
is greater than or equal to M(x 1 k). It is easy to see that M'(x,k) ^ 2 M(x,k) and 
that M' is monotone. 

We call pair (x, k) “essential” if k — 0 or M'(x,k) > M'(x,k— 1). The sum of 
M\x,k) for all essential pairs with given x is at most twice bigger than its biggest 
term (because each term is at least twice bigger than the preceding one), and its 
biggest term is at most twice bigger than M(x, k) for some k. Since M(x, k) ^ m(x) 
and £m(x) ^ 1, we conclude that the sum of M'(x. k) for all essential pairs (x. k) 
does not exceed 4. 

Let (xj, ki) be a computable sequence of all essential pairs. (We enumerate all 
pairs and select essential ones.) Let n, be an integer such that 2~ Hi = Mf xj, ki)/4. 
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Then £ 2~ n > f 1. 

Therefore, X’(i') <C tij. Since x,- is obtained from i by an algorithm, we conclude 
that K(xi) ^ rii + 0( 1) for all i. For a given x one can find i such that x ; = x and 
2~ ni ^ m ( /4, so n; ^ —logm(x) + 2 and A'(x) ^ —logm(x) + 0( 1). □ 

37 Prefix complexity of a pair 

We can define K(x. y) as prefix complexity of some code [x,y] of pair (x,y). As 
usual, different computable encodings give complexities that differ at most by 
0(1), 

Theorem 30. 

K(x,y) ^ K(x) + A’(y) + 0(1). 

Note that now we do not need 0(\ogn ) term that was necessary for plain com¬ 
plexity. 

Proof. Let us give two proofs of this theorem using prefix functions and a priori 
probability. 

(1) Let D be the optimal prefix decompressor used in the definition of K. Con¬ 
sider a function D' such that 


D\pq) = [ D{p),D{q )] 

for all strings p and q such that D(p) and D(q) are defined. Let us prove that this 
definition makes sense, i.e., that it does not lead to conflicts. Conflict happens if 
pq — p'q' and D(p),D(q),D(p'),D(q') are defined. But then p and p' are prefixes 
of the same string and are compatible, so D{p ) and D(//) cannot be defined at the 
same time unless p — p' (which implies q = q'). 

Let us check that D' is a prefix function. Indeed, if it is defined for pq and 
p'q', and at the same time pq is a prefix of p'q', then (as we have seen) p and p' 
are compatible and (since D(p) and D(p') are defined) p — //. Then q is a prefix 
of q 1 , so D(q) and D(q') cannot be defined at the same time. 

The function D 1 is computable (for given x we try all decompositions x — pq 
in parallel). So we have a prefix algorithm D' such that Co([x,y]) f K (x) + K (y) 
and therefore K(x 1 y) f K (x) + K(y) + 0(1). (End of the first proof.) 

(2) In terms of a priori probability we have to prove that 

m([x,y\) f em(x)m(y) 
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for some positive £ and all x and y. Consider the function m! determined by the 
equation 

m'([x,y ]) = m(x)m[y) 

(. m! is zero for inputs that do not encode pairs of strings). We have 

= ^m'([x,y]) = J^m(x)m{y) = <1-1 = 1- 

z x,y x,y x y 

Function m! is lower semicomputable, so m! is a semimeasure. Therefore, it is 
bounded by maximal semimeasure (up to a constant factor). □ 

A similar (but a bit more complicated) argument shows the equality 

K(x,y)=K(x)+K(y\x,K(x)) + 0(l). 


38 Prefix complexity and randomness 

Theorem 31. A sequence xqX\X 2 ... is Martin-Lof random if and only if there 
exists some constant c such that 


K(x 0 * 1 .. -x„-i) ^ n — c 


for all n. 

Proof We have to prove that the sequence xqX\X 2 ... is not random if and only if 
for every c there exists n such that 


K(x o*i... x n -\) <n — c. 

(If-part) A string u is called (for this proof) c-defective if K(u) < \u\ — c. We 
have to prove that the set of all sequences that have c-defective prefix for all c, is 
an effectively null set. It is enough to prove that the set of all sequences that have 
c-defective prefix for a given c can be covered by intervals with total measure 2~ c . 

Note that the set of all c-defective strings is enumerable (since K is enumerable 
from above). It remains to show that the sum over all c-defective u does 

not exceed 2 _c . Indeed, if u is c-defective, then by definition 2~l“l ^ 2~ c 2~ KP ( lt \ 
On the other hand, the sum of 2 ~ K M over all u (and therefore over defective u) 
does not exceed 1. 
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(Only-if-part) Let N be the set of all non-random sequences. N is an effectively 
null set. For each integer c consider a sequence of intervals 


^u(c,0)> ^«(c,l); ^u(c, 2): • • • 

that cover N and have total measure at most 2 . Definition of effectively null 

sets guarantees that such a sequence exists (and its elements can be effectively 
generated when c is given). 

For each c,i consider the integer n(c, i) — \u(c, /') — c. For a given c the 
sum does not exceed 2~ c (because the sum 2 _ l“( c ’') does not exceed 

2 2c ). Therefore the sum £ c ; -2 _ ”( c,! ) over all c an d i does not exceed 1. 

We would like to consider a semimeasure M such that M(u(c,i )) = 2~ n ( c ^; 
however, it may happen that u(c. i) coincide for different pairs c, i. In this case we 
add the corresponding values, so the precise definition is 

^(*) = E{2 - " M \u(c,i)=x}. 

Note that M is lower semicomputable, since u and n are computable functions. 
Therefore, if m is the universal semimeasure, we have m(x) ^ eM(x), so K(x) ^ 
— logM(.v) + D(1), and K(u(c,i )) ^ n(c,i) +0(1) = |w(c, i)| — c + 0(l). 

If some sequence xoX\X 2 ■ ■ ■ belongs to the set N of non-random sequences, 
then it has prefixes of the form u(c, i) for all c, and for these prefixes the difference 
between length and K is not bounded. □ 

39 Strong law of large numbers revisited 

Let p, q be positive rational numbers such that p + q — 1. Consider the following 
semimeasure: a string x of length n with k ones and l zeros has probability 

mO) = 4pY 

where constant c is chosen in such a way that £„c/n 2 ^ 1. It is indeed a semimea¬ 
sure (the sum over all strings x is at most 1, because the sum of /l (x) over all strings 
x of given length n is 1 frr\ p k q l is a probability to get string x for a biased coin 
whose sides have probabilities p and q). 

Therefore, we conclude that /l (v) is bounded by a priori probability (up to a 
constant) and we get an upper bound 

K(x ) ^ 2\ogn+k(-\ogp) + l(-\ogq) + 0(1) 
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for fixed p and q and for arbitrary string x of length n that has k ones and / zeros. 
If p = q = 1/2, we get the bound K(x) ^ h +2 logo+ 0(1) that we already know. 
The new bound is biased: If p > 1 /2 and q < 1/2, then — log p < 1 and — log q > 1, 
so we count ones with less weight than zeros, and new bound can be better for 
strings that have many ones and few zeros. 

Assume that p > 1/2 and the fraction of ones in x is greater that p. Then our 
bound implies 


K(x) + 21og« + «p( —logp) + nq(-\ogq) + 0(1) 

(more ones make our bound only tighter). It can be rewritten as 

K(x) ^ nH(p,q ) + 21og/z + 0(l) 

where H(p,q) is Shannon entropy for two-valued distribution with probabilities 
p and q: 

H(p,q) = -plogp-qlogq. 

Since p + q — 1, we have function of one variable: 

H(p) =H(p,l-p) = —plogp — (l — p) log(l —p). 

This function has a maximum at 1/2; it is easy to check using derivatives that 
H{p) = 1 when p = 1/2 and H(p) < 1 when p / 1/2. 

Corollary. For every p > 1/2 there exist a constant a < 1 and a constant c 
such that 

K(x) < an + 2\ogn+c 

for each string x where frequency of Is is at least p. 

Therefore, for every p > 1/2, an infinite sequence of zeros and ones that has 
infinitely many prefixes with frequency of ones at least p, is not Martin-Lof ran¬ 
dom. This gives us a proof of a constructive version of Strong Law of Large 
Numbers: 

Theorem 32. Every Martin-Lof random sequence .v'o.r i xi ... of zeros and ones is 
balanced: 

XqX\X n —\ 1 

hm-= -. 

n—n 2 
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Problems 

1. Let D be a prefix decompression algorithm. Give a direct construction of a 
probabilistic machine that outputs i with probability at least T Kl>(l K 
2* Prove that K(x) ^ C(x) + K(C(x)) 

3. Prove that there exists an infinite sequence xqx\ ... and a constant c such 

that 

C(x qX\ .. .x n -i) ^ n — 21og n + c 

for all n. 


40 Hausdorff dimension 

Let a be a positive real number. A set X C Cl of infinite bit sequences is called 
a-null if for every e > 0 there exists a set of strings «o, «i, « 2 , • • • such that 
(lJIC^U^U^U...; 

(2) E,-2-“W < e . 

In other terms, we modify the definition of a null set: instead of the uni¬ 
form measure P(Cl u ) — 2 - l“l of an interval Cl u we consider its a-size ( P{Cl u )) a — 
2 _ “l"l. For a > 1 we get a trivial notion: all sets are a-null (one can cover the 
entire Cl by 2 ,v intervals of size 2 /V , and 2 N ■ 2~ aN = I /2 {a "' v ^ N is small for large 
N). For a — 1 we get the usual notion of null sets, and for a < 1 we get a smaller 
class of sets (the smaller a is, the stronger condition we get). 

For a given set X C Cl consider the infimum of a such that X is an a-null set. 
This infimum is called the Hausdorff dimension of X. As we have seen, for the 
subsets of Cl the Hausdorff dimension is at most 1. 

This is a classical notion but it can be constructivized in the same way as 
for null sets. A set X C Cl of infinite bit sequences is called effectively a-null if 
there is an algorithm that, given a rational £ > 0, enumerates a sequence of strings 
wo, «i, U 2 , ■ ■ ■ satisfying (1) and (2). The following result extends Theorem [T9l 

Theorem 33. Let a > 0 be a rational number. Then there exists an effectively 
a-null set N that contains every effectively a-null set. 

Proof. We can use the same argument as for Theorem [lU since a is rational, we 
can compute the a-sizes of intervals with arbitrary precision, and this is enough 
to ensure that the sum of a-sizes of a finite set of intervals is less than £. (The 
same argument works for every computable a.) □ 
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Now we define effective Hausdorff dimension of a set X C Q as the infimum 
of a such thatX is an effectively a-null set. It is easy to see that we may consider 
only rational a in this definition. The effective Hausdorff dimension cannot be 
smaller than the (classical) Hausdorff dimension, but may be bigger (see below). 

We define the effective Hausdorff dimension of a point ^ G as the effec¬ 
tive Hausdorff dimension of the singleton {x}. Note that there is no classical 
counterpart of this notion, since every singleton has Hausdorff dimension 0. 

For effectively null sets we have seen that this property of the set was essen¬ 
tially the property of its elements (all elements should be non-random); a similar 
result is true for effective Hausdorff dimension. 

Theorem 34. For every set X its effective Hausdorff dimension equals the supre- 
mum of effective Hausdorff dimensions of its elements. 

Proof Evidently, the dimension of an element of X cannot exceed the dimension 
of the set X itself. On the other hand, if for some rational a > 0 all elements of 
X have effective dimension less than a, they all belong to the maximal effectively 
a-null set, so X is a subset of this maximal set, so X is effectively a-null set, and 
the effective dimension of X does not exceed a. □ 

The criterion of Martin-Lof randomness in terms of complexity (Theorem 1311) 
also has its counterpart for effective dimension. The previous result (Theorem [34b 
shows that it is enough to characterize the effective dimension of singletons, and 
this can be done: 

Theorem 35. The effective Hausdorff dimension of a sequence X — * 0 * 1*2 ... is 
equal to 

limin g*o*i 

n—7-°° 77 

In this statement we use prefix complexity, but one may use the plain com¬ 
plexity instead (since the difference is at most Oflogn) for 77-bit strings). 

Proof. If the liminf is smaller than a, then K(u) f a|w| for infinitely many pre¬ 
fixes of x ■ For the strings u with this property we have 

2-“H ^ m(u) 

where m is a priori probability, and the sum of m(u) over all u is bounded by 1. So 
we get a family of intervals that cover x infinitely many times and have the sum of 
a-sizes bounded by 1. If we (1) increase a a bit and consider some a' > a, and 
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(2) consider only strings u of length greater than some large N, we get a family of 
intervals that cover % and have small sum of a 7 -sizes (bounded by 2 !a ~ a >,N , to be 
exact). This argument shows that the Hausdorff dimension of % does not exceed 
the liminf. 

It remains to prove the reverse inequality. Assume that % has effective Haus¬ 
dorff dimension less than some (rational) a. Then we can effectively cover % 
by a family of intervals with arbitrarily small sum of a-sizcs. Combining the 
covers with sum bounded by 1/2,1/4,1/8,..., we get a computable sequence 
uq,ui,U 2 , ... such that 

( 1 ) intervals kl llQ , £l Ul , kl U2 ,... cover % infinitely many times; 

(2) £2~ a N ^ 1. 

The second inequality implies that K(i) ^ cc\uj\ + 0(1), and therefore K(iii) ^ 
K(i ) + 0(1) ^ a\uj\ + 0(1). Since % has infinitely many prefixes among Uj, we 
conclude that our liminf is bounded by a. □ 

This theorem implies that Martin-Lof random sequences have dimension 1 (it 
is also a direct consequence of the definition); it also allows us to construct easily 
a sequence of dimension a for arbitrary a 6 ( 0 , 1 ) (by adding incompressible 
strings to increase the complexity of the prefix and strings of zeros to decrease it 
when needed). 


41 Problems 


1. Let k„ be average complexity of binary strings of length n: 


I m 

_\x\=n 


/r. 


Prove that k n — n + 0(1) (i.e., \k n — n\ < c for some c and all n). 

2. Prove that for a Martin-Lof random sequence a^a \ ajci^ ... the set of all i 
such that a,- = 1 is not enumerable (there is no program that generates elements of 
this set). 

3. (Continued) Prove the same result for Mises-Church random sequences. 

4. String v = yz of length 2 n is incompressible: C(x) ^ 2 n\ strings y and z 
have length n. Prove that C(y),C(z) ^ n — 0(\ogn). Can you improve this bound 
and show that C(y),C(z ) ^ n — 0(1)? 

5. (Continued) Is the reverse statement (if y and z are incompressible, then 
C(yz) = 2 n + 0 (log«)) true? 
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6 . Prove that if C(y\z) ^ n and C{z\y) ^ n for strings y and z of length n, then 
C(yz) ^ 2n — 0(\ogn). 

7. Prove that if x and y are strings of length n and C{xy) ^ 2 n, then the length 
of every common subsequence u of x and y does not exceed 0.99 n. (A string u is a 
subsequence of a string v if u can be obtained from v by deleting some terms. For 
example, 111 is a subsequence of 010101 , but 1110 and 1111 are not.) 

8 . Let ciQa\ci 2 ... and bob\b 2 ... be Martin-Lof random sequences and coCic 2 ... 
be a computable sequence. Can the sequence (ao © b o) (ai © Z?i)(a 2 © fc 2 ) • • • be 
non-random? (Here a © b denotes a + b mod 2.) The same question for (ao © 
co)(ai©ci)(a 2 ©c 2 )... 

9. True or false: C(x,y ) ^ K(x) + C(y) + C(1)? 

10. Prove that for every c there exists x such that K(x) — C(x) > c. 

11. Let m{x) be a priori probability of string x. Prove that the binary represen¬ 
tation of real number J^ x m (x) is a Martin-Lof random sequence. 

12. Prove that C(x) + C(x,y,z) ^ C(x,y) + C(x,z) + 0(logn) for strings x,y,z 
of length at most n. 

13. (Continued) Prove a similar result for prefix complexity with C(l) instead 
of 0(\ogn). 
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